fix: handle HTTP 413 by splitting and retrying in OTLP HTTP exporters by Krishnachaitanyakc · Pull Request #5032 · open-telemetry/opentelemetry-python

Krishnachaitanyakc · 2026-04-02T06:15:27Z

Summary

When a backend returns HTTP 413 (Payload Too Large), the OTLP HTTP trace and log exporters now split the batch in half and recursively retry each half, preventing silent data loss when batch sizes exceed backend limits.

Fixes #4533

Changes

Added _is_payload_too_large() helper in _common/__init__.py
Refactored export() to delegate to _export_batch() in both trace and log exporters
_export_batch() handles 413 responses with binary splitting:
- Base case: single-item batch returns FAILURE (item is genuinely too large)
- Deadline guard: if deadline expired, returns FAILURE without recursing
- Short-circuit: if first half fails, second half is not attempted
- Recursive split: halves the batch and exports each half independently

Notes

The metric exporter already has proactive batch splitting via max_export_batch_size and _split_metrics_data(). Reactive 413 handling for metrics is deferred to a follow-up since metric data has a nested protobuf structure that requires different splitting logic.
The gRPC exporter uses a different status code system (RESOURCE_EXHAUSTED) and would need separate handling in a future PR.

Test plan

5 new span exporter tests: split success, single-item failure, recursive splitting, partial failure short-circuit, deadline expiry
5 new log exporter tests: same scenarios
All existing tests pass (no regressions)
ruff linter passes

DylanRussell · 2026-04-02T13:10:35Z

Do other languge's OTLP HTTP exporters do something similar ? It'd be good to see how other languages handle this..

Krishnachaitanyakc · 2026-04-02T15:29:56Z

I checked on this to see how other OTel language SDKs handle HTTP 413:

OTLP Specification (otlp/#failures): Only 429, 502, 503, and 504 are listed as retryable. 413 is not mentioned at all, and the spec says "All other 4xx or 5xx response status codes MUST NOT be retried."

SDK	413 Handling	Batch Splitting
Go (client.go)	Non-retryable, immediate failure	No
Java (HttpExporter.java)	Non-retryable, immediate failure. Has open issue #5394 for oversized payloads.	No
.NET	Non-retryable	No
JavaScript	Non-retryable	No

No other SDK currently handles 413 with batch splitting. This PR would make Python the first to implement this recovery strategy.

Also, the spec says 4xx codes "MUST NOT be retried" but splitting and retrying with a smaller payload is not the same as retrying the same request. It's a distinct recovery strategy. The current behavior in all SDKs is silently dropping the entire batch, which causes data loss.

tammy-baylis-swi · 2026-04-02T20:11:53Z

Thank you for starting this @Krishnachaitanyakc , and for checking the spec and other instrumentors.

The scope of #4533 is vague, especially given the current state of OTel Python's OTLP HTTP vs gRPC, span/metrics/logs export. I'm going to comment there.

...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py

herin049 · 2026-04-09T15:35:55Z

...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py

+    def _export_batch(
+        self, batch: Sequence[ReadableLogRecord], deadline_sec: float
+    ) -> LogRecordExportResult:
+        serialized_data = encode_logs(batch).SerializeToString()


Under what scenarios do you need this feature? If you already know the payload limits of your backend, we could bisect based on a configurable max_payload_size, and eliminate the 413 errors entirely.

The reactive approach covers cases where the backend limit isn't known upfront or changes without SDK reconfiguration (e.g., proxies). I can tackle the proactive splitting via max_payload_size in a follow-up.

@Krishnachaitanyakc Do you have a custom collector implementation that returns 413 error codes? From my understanding the vanilla OpenTelemetry collector does not return 413 errors.

@herin049 The vanilla Collector returns 400, not 413. The 413 comes from real world examples such as:

New Relic (Dropped Items after Status Code 413 (Content Too Large) opentelemetry-collector#14298)

Datadog ([Datadog] Payload too large opentelemetry-collector-contrib#1925)

InfluxDB (feat(influxdbexporter): limit size of payload opentelemetry-collector-contrib#24001)

Elasticsearch ([Exporter/Elasticsearch] Fix HTTP 413 (Payload Too Large) in Elasticsearch exporter opentelemetry-collector-contrib#46022)

The Collector's own exporter already treats 413 as a permanent error (open-telemetry/opentelemetry-collector#5674). There's also active spec work on payload limits (open-telemetry/opentelemetry-proto#782).

@Krishnachaitanyakc
Right, I'm more asking about the following:

Typically the flow for telemetry is:

OTel SDK (Python) -> OTel Collector -> Backend (New Relic, Datadog, etc.)

The only way you'd get a 413 error code is if the collector returned it to the SDK, which from my understanding is not something the collector currently does.

In that flow, yes, the vanilla Collector doesn't return 413 to the SDK.

But the direct-to-backend (no Collector) is a supported deployment pattern which is what the original reporter seems to be using on #4533

...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py

herin049 · 2026-04-10T02:05:43Z

If we decide we want to support this, we should make it opt-in only since it's not a part of the spec. We can do this with an experimental environment variable, or a private keyword only argument.

...porter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py

…rying When a backend returns HTTP 413 (Payload Too Large), the trace and log exporters now split the batch in half and recursively retry each half. This prevents silent data loss when batch sizes exceed backend limits. The splitting includes deadline guards to prevent infinite recursion, short-circuits on first-half failure to avoid wasting time on the second half, and drops individual items that are genuinely too large. Fixes open-telemetry#4533

- Add CHANGELOG.md entry for the 413 splitting feature - Apply ruff format to source files (line wrapping adjustments) - Rename loop variable 'i' to 'idx' to satisfy pylint naming convention

Relax assertAlmostEqual tolerance from 2 decimal places (0.005) to 1 (0.05) in timeout tests. The _export_batch refactoring adds a serialization step between deadline calculation and the HTTP POST, consuming a few extra milliseconds that exceed the tight tolerance on slow runtimes like PyPy on Windows.

…l flow - Add _MAX_BISECTS=5 to cap recursive splitting depth - Combine 413 guard with len>1 and remaining_bisects>0 check so single-item 413 falls through to the existing non-retryable path - Check self._shutdown alongside deadline in the 413 handler - Add tests for max bisect depth exhaustion and shutdown during 413

Add pylint disable comment, matching the pattern used in test_otlp_metrics_exporter.py.

The HTTP 413 payload splitting behavior is not part of the OpenTelemetry specification. Gate it behind the experimental environment variable OTEL_PYTHON_EXPERIMENTAL_OTLP_RETRY_ON_413 (must be set to "true" to enable). When unset, 413 responses are treated as non-retryable errors. Also refactors the control flow per review feedback: the bisectable flag is computed alongside retryable, checked after the retry-exit block, and the splitting logic is moved to after line 257 in the original code.

- Add pylint disable for too-many-public-methods on TestOTLPSpanExporter (21 methods exceeds limit of 20, matching existing log exporter fix) - Reset DuplicateFilter state between log exporter tests via setUp() to prevent log suppression from 413 tests bleeding into test_export_no_collector_available

Krishnachaitanyakc · 2026-04-12T00:06:35Z

@herin049 can you please review this?

Krishnachaitanyakc · 2026-04-13T06:32:49Z

@herin049 can you please review this?

Krishnachaitanyakc · 2026-04-13T22:25:21Z

Opened open-telemetry/opentelemetry-specification#5022 to take the recovery-semantics question to Spec SIG.

Given the current OTLP wording around non-retryable 4xx responses and the review feedback here, this seems to need either spec guidance or a clearly opt-in Python-specific behavior.

herin049 · 2026-04-14T01:05:47Z

...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py

+                    _is_payload_too_large(resp)
+                    and len(batch) > 1
+                    and remaining_bisects > 0
+                    and environ.get(


This should ideally be bound to an instance variable in the constructor (e.g. self._retry_entity_too_large)

herin049 · 2026-04-14T01:07:40Z

...xporter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py

+                if time() >= deadline_sec or self._shutdown:
+                    _logger.error(
+                        "Payload too large but %s, dropping %d log records",
+                        "shutdown in progress"
+                        if self._shutdown
+                        else "deadline expired",
+                        len(batch),
+                    )
+                    return LogRecordExportResult.FAILURE


This shouldn't be necessary anymore because of lines 212-215.

Krishnachaitanyakc marked this pull request as ready for review April 2, 2026 06:57

Krishnachaitanyakc requested a review from a team as a code owner April 2, 2026 06:57

xrmx added this to Python PR digest Apr 2, 2026

tammy-baylis-swi moved this to Ready for review in Python PR digest Apr 2, 2026

tammy-baylis-swi mentioned this pull request Apr 2, 2026

Retry 413 / payload too large errors in OTLP batch exporter #4533

Open

herin049 requested changes Apr 9, 2026

View reviewed changes

github-project-automation bot moved this from Ready for review to Reviewed PRs that need fixes in Python PR digest Apr 9, 2026

Krishnachaitanyakc force-pushed the feat/retry-413-payload-splitting branch from a3b2ef6 to 322b31b Compare April 9, 2026 20:33

herin049 reviewed Apr 10, 2026

View reviewed changes

...porter-otlp-proto-http/src/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py Outdated Show resolved Hide resolved

Krishnachaitanyakc closed this Apr 10, 2026

github-project-automation bot moved this from Reviewed PRs that need fixes to Done in Python PR digest Apr 10, 2026

Krishnachaitanyakc reopened this Apr 10, 2026

Krishnachaitanyakc added 9 commits April 10, 2026 18:06

Fix CI: add CHANGELOG entry, apply ruff format, fix pylint variable name

e14b425

- Add CHANGELOG.md entry for the 413 splitting feature - Apply ruff format to source files (line wrapping adjustments) - Rename loop variable 'i' to 'idx' to satisfy pylint naming convention

ci: trigger re-run of flaky CI checks

d4accab

Fix pylint too-many-public-methods in log exporter tests

ba3fc8d

Add pylint disable comment, matching the pattern used in test_otlp_metrics_exporter.py.

Fix pylint C0103: rename single-char variable in setUp

ee1a1d8

Krishnachaitanyakc force-pushed the feat/retry-413-payload-splitting branch from 680a86d to ee1a1d8 Compare April 11, 2026 01:11

herin049 reviewed Apr 14, 2026

View reviewed changes

Conversation

Krishnachaitanyakc commented Apr 2, 2026

Summary

Changes

Notes

Test plan

Uh oh!

DylanRussell commented Apr 2, 2026

Uh oh!

Krishnachaitanyakc commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tammy-baylis-swi commented Apr 2, 2026

Uh oh!

Uh oh!

herin049 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Krishnachaitanyakc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

herin049 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Krishnachaitanyakc Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

herin049 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Krishnachaitanyakc Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

herin049 commented Apr 10, 2026

Uh oh!

Uh oh!

Krishnachaitanyakc commented Apr 12, 2026

Uh oh!

Krishnachaitanyakc commented Apr 13, 2026

Uh oh!

Krishnachaitanyakc commented Apr 13, 2026

Uh oh!

herin049 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

herin049 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Krishnachaitanyakc commented Apr 2, 2026 •

edited

Loading

Krishnachaitanyakc Apr 14, 2026 •

edited

Loading

Krishnachaitanyakc Apr 14, 2026 •

edited

Loading